In this module, we will examine the MNIST dataset, which is a set of 70,000 images of digits handwritten by high school students and employees of the US Census Bureau.
MNIST is considered the “hello-world” of the machine-learning world, and is often a good place to start for understanding classification algorithms.
Let’s load the MNIST dataset.
library(MicrosoftML)
library(tidyverse)
Loading tidyverse: ggplot2
Loading tidyverse: tibble
Loading tidyverse: tidyr
Loading tidyverse: readr
Loading tidyverse: purrr
Loading tidyverse: dplyr
Conflicts with tidy packages ---------------------------------------------------------------
filter(): dplyr, stats
lag(): dplyr, stats
library(magrittr)
Attaching package: ‘magrittr’
The following object is masked from ‘package:purrr’:
set_names
The following object is masked from ‘package:tidyr’:
extract
library(dplyrXdf)
theme_set(theme_minimal())
mnist_xdf <- file.path("..", "data", "MNIST.xdf")
mnist_xdf <- RxXdfData(mnist_xdf)
Let’s take a look at the data:
rxGetInfo(mnist_xdf)
File name: /home/alizaidi/learnAnalytics-MicrosoftML/Student-Resources/data/MNIST.xdf
Number of observations: 70000
Number of variables: 786
Number of blocks: 7
Compression type: zlib
Our dataset contains 70K records, and 786 columns. There are actually 784 features, because each image in the dataset is a 28x28 pixel image. The two additional columns are for the label, and a column with a pre-sampled train and test split.
Let’s make some visualizations to examine the MNIST data and see what we can use for a classifier to classify the digits.
mnist_df <- rxDataStep(inData = mnist_xdf, outFile = NULL,
maxRowsByCols = nrow(mnist_xdf)*ncol(mnist_xdf)) %>% tbl_df
Let’s see the average for each digit:
mnist_df %>%
keep(is.numeric) %>%
rowMeans() %>% data.frame(intensity = .) %>%
tbl_df %>%
bind_cols(mnist_df) %T>% print -> mnist_df